Oct 12, 2022
Last time:
Today:
# Start with the usual imports
# We'll use these throughout
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests
Many websites will reject a requests.get() function call if you do not specify the User-Agent header as part of your GET request. This lets the website identify who is making the GET request. You can find your browser's User-Agent value in the "Network" tab of your browser's developer tools. If you click on any request listed on this tab, and go to the "Headers" tab, you should see the "user-agent" value listed:
url = "https://www.phila.gov/programs/coronavirus-disease-2019-covid-19/updates/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.37"
result = requests.get(url, headers={"User-Agent": user_agent}) # NEW: Specify the "User-Agent" header
soup = BeautifulSoup(result.content, "html.parser")
Use the web inspector to identify the correct CSS selector (right click -> Inspect and then right click -> Copy -> Copy Selector)
selector = "#post-263624 > div.one-quarter-layout > div:nth-child(1) > div.medium-18.columns.pbxl > ul > li:nth-child(1)"
Select the element using the CSS selector and get the text:
avg = soup.select_one(selector).text
avg
'Average new cases per day: 177'
Split the string into words:
words = avg.split()
words
['Average', 'new', 'cases', 'per', 'day:', '177']
Get the last element and convert to an integer:
int(words[-1])
177
selector = "#post-263624 > div.one-quarter-layout > div:nth-child(1) > div.medium-18.columns.pbxl > p:nth-child(3) > em"
last_updated = soup.select_one(selector).text
last_updated
'Cases last updated: October 4, 2022\nHospitalizations last updated: September 28, 2022'
Break into lines:
lines = last_updated.splitlines()
lines
['Cases last updated: October 4, 2022', 'Hospitalizations last updated: September 28, 2022']
Split by the colon:
lines[0].split(":")
['Cases last updated', ' October 4, 2022']
last_updated_date = lines[0].split(":")[-1]
last_updated_date
' October 4, 2022'
Convert to a timestamp:
timestamp = pd.to_datetime(last_updated_date)
timestamp
Timestamp('2022-10-04 00:00:00')
timestamp.strftime("%B %-d, %Y")
'October 4, 2022'
Even more: 101 Web Scraping Exercises
For each of the exercises, use the Web Inspector to inspect the structure of the relevant web page, and identify the HTML content you will need to scrape with Python.
A number of councilmembers have resigned in order for them to run for mayor in the spring. Let's find out how many seats are on Council and how many are currently vacant!
Determine two things:
Relevant URL: https://phlcouncil.com/council-members/
Hints:
Extract the following:
Note: we are looking for food-borne violations only, and not all restaurants listed will have food-borne violations listed
Relevant URL: http://data.inquirer.com/inspections
How do you scrape data that only appears after user interaction?
You'll need a web browser installed to use selenium, e.g., FireFox, Google Chrome, Edge, etc.
Selenium will open a web browser, load the page, and the browser will respond to the commands issued by selenium
# Import the webdriver from selenium
from selenium import webdriver
The initialization steps will depend on which browser you want to use!
If you are working on Binder, you'll need to use FireFox in "headless" mode, which prevents a browser window from opening.
If you are working locally, it's better to run with the default options — you'll be able to see the browser window open and change as we perform the web scraping.
# UNCOMMENT BELOW TO USE CHROME
# from webdriver_manager.chrome import ChromeDriverManager
# from selenium.webdriver.chrome.service import Service
# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
If you are working on Binder, use the below code!
# UNCOMMENT BELOW IF ON BINDER
# from webdriver_manager.firefox import GeckoDriverManager
# from selenium.webdriver.firefox.service import Service
# options = webdriver.FirefoxOptions()
# IF ON BINDER, RUN IN "HEADLESS" MODE (NO BROWSER WINDOW IS OPENED)
# COMMENT THIS LINE IF WORKING LOCALLY
# options.add_argument("--headless")
# Initialize
# driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=options)
# UNCOMMENT BELOW TO USE MICROSOFT EDGE
# from webdriver_manager.microsoft import EdgeChromiumDriverManager
# from selenium.webdriver.edge.service import Service
# driver = webdriver.Edge(service=Service(EdgeChromiumDriverManager().install()))
Strategy:
# Open the URL
url = "https://ujsportal.pacourts.us/CaseSearch"
driver.get(url)
We'll need to:
selenium Select() objectfrom selenium.webdriver.common.by import By
# Use the Web Inspector to get the css selector of the dropdown select element
dropdown_selector = "#SearchBy-Control > select"
# Select the dropdown by the element's CSS selector
dropdown = driver.find_element(By.CSS_SELECTOR, dropdown_selector)
from selenium.webdriver.support.ui import Select
# Initialize a Select object
dropdown_select = Select(dropdown)
Change the selected element: "Police Incident/Complaint Number"
# Set the selected text in the dropdown element
dropdown_select.select_by_visible_text("Incident Number")
# Get the input element for the DC number
incident_input_selector = "#IncidentNumber-Control > input"
incident_input = driver.find_element(By.CSS_SELECTOR, incident_input_selector)
# Clear any existing entry
incident_input.clear()
# Input our example incident number
incident_input.send_keys("1725088232")
# Submit the search
search_button_id = "btnSearch"
driver.find_element(By.ID, search_button_id).click()
page_source attribute to get the current HTML displayed on the pagecourtsSoup = BeautifulSoup(driver.page_source, "html.parser")
<table> element and each <tr> element within the table# Select the results container by its ID
results_table = courtsSoup.select_one("#caseSearchResultGrid")
# Get all of the <tr> rows inside the tbody element
# NOTE: we using nested selections here!
results_rows = results_table.select("tbody > tr")
Example: The number of court cases
# Number of court cases
number_of_cases = len(results_rows)
print(f"Number of courts cases: {number_of_cases}")
Number of courts cases: 2
Example: Extract the text elements from the first row of the results
first_row = results_rows[0]
print(first_row.prettify())
<tr class="slide-active">
<td class="display-none">
1
</td>
<td class="display-none">
0
</td>
<td>
MC-51-CR-0030672-2017
</td>
<td>
Common Pleas
</td>
<td>
Comm. v. Velquez, Victor
</td>
<td>
Closed
</td>
<td>
10/13/2017
</td>
<td>
Velquez, Victor
</td>
<td>
09/05/1974
</td>
<td>
Philadelphia
</td>
<td>
MC-01-51-Crim
</td>
<td>
U0981035
</td>
<td>
1725088232-0030672
</td>
<td>
1725088232
</td>
<td class="display-none">
</td>
<td class="display-none">
</td>
<td class="display-none">
</td>
<td class="display-none">
</td>
<td>
<div class="grid inline-block">
<div>
<div class="inline-block">
<a class="icon-wrapper" href="/Report/CpDocketSheet?docketNumber=MC-51-CR-0030672-2017&dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
<img alt="Docket Sheet" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=3-Me4WMBYQPCgs0IdgGyzeTEx_qd5uveL0qyDZoiHPM#icon-document-letter-D" title="Docket Sheet"/>
<label class="link-text">
Docket Sheet
</label>
</a>
</div>
</div>
</div>
<div class="grid inline-block">
<div>
<div class="inline-block">
<a class="icon-wrapper" href="/Report/CpCourtSummary?docketNumber=MC-51-CR-0030672-2017&dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
<img alt="Court Summary" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=3-Me4WMBYQPCgs0IdgGyzeTEx_qd5uveL0qyDZoiHPM#icon-court-summary" title="Court Summary"/>
<label class="link-text">
Court Summary
</label>
</a>
</div>
</div>
</div>
</td>
</tr>
# Extract out all of the "<td>" cells from the first row
td_cells = first_row.select("td")
# Loop over each <td> cell
for cell in td_cells:
# Extract out the text from the <td> element
text = cell.text
# Print out text
if text != "":
print(text)
1 0 MC-51-CR-0030672-2017 Common Pleas Comm. v. Velquez, Victor Closed 10/13/2017 Velquez, Victor 09/05/1974 Philadelphia MC-01-51-Crim U0981035 1725088232-0030672 1725088232 Docket SheetCourt Summary
driver.close()
Coined by Simon Willison in this blog post
Data is scraped daily, saved to a CSV file, and added to a Github repository
Data is then tweeted daily, providing an easily accessible record of homicides over time
Source code is available on Github at nickhand/phl-homicide-bot
Key features:
Example repo available at: https://github.com/MUSA-550-Fall-2022/covid-stats-bot
Use Github Actions to run this workflow once a day
This will allow you to pass your Twitter API credentials to tweepy with compromising security and storing them in plaintext on Github!
Data is tracked and updated over time in data.csv
Info is also tweeted each time it is updated!